2024-10-29
Two-Factor (Two-Way) Analysis of Variance
We’ve discussed subsets of these data in previous classes.
dm464 <- read_xlsx("c18/data/dm464.xlsx") |>
janitor::clean_names() |>
mutate(across(where(is.character), as_factor)) |>
mutate(id_code = as.character(id_code),
smoke_start = fct_relevel(smoke_start, "Current", "Former"),
smoke_end = fct_relevel(smoke_end, "Current", "Former"),
insur_start = fct_relevel(insur_start, "Medicare",
"Commercial", "Medicaid"),
insur_end = fct_relevel(insur_end, "Medicare",
"Commercial", "Medicaid"))
dim(dm464)[1] 464 42
| Variable | Description |
|---|---|
id_code |
DM-xxxx code to identify subjects (not consecutive) |
bmi_end |
Body Mass Index in \((kg/m^2)\) at the end of the period |
dep_start |
Depression diagnosis at the start of the period (1 = yes, 0 = no) |
ed_start |
High = more than 78% of adult residents of subject’s home neighborhood graduated high school, vs. Low = no more than 78%. |
insur_start |
Subject’s insurance status at the start (3 levels) |
dm18 <- dm464 |>
mutate(ed_start = factor(ifelse(nedlev_start > 78, "High", "Low")),
dep_start = factor(depdiag_start)) |>
select(id_code, bmi_end, dep_start, ed_start, insur_start)
head(dm18)# A tibble: 6 × 5
id_code bmi_end dep_start ed_start insur_start
<chr> <dbl> <fct> <fct> <fct>
1 DM-1007 47.2 0 Low Medicare
2 DM-1017 34.7 0 High Commercial
3 DM-1039 50 1 High Medicare
4 DM-1049 57.2 1 Low Medicaid
5 DM-1064 45.2 1 Low Medicaid
6 DM-1070 34.3 1 High Medicaid
data_codebook() resultsselect(dm18, -id_code) (464 rows and 4 variables, 4 shown)
ID | Name | Type | Missings | Values | N
---+-------------+-------------+----------+------------+------------
1 | bmi_end | numeric | 0 (0.0%) | [16, 78.3] | 464
---+-------------+-------------+----------+------------+------------
2 | dep_start | categorical | 0 (0.0%) | 0 | 295 (63.6%)
| | | | 1 | 169 (36.4%)
---+-------------+-------------+----------+------------+------------
3 | ed_start | categorical | 0 (0.0%) | High | 228 (49.1%)
| | | | Low | 236 (50.9%)
---+-------------+-------------+----------+------------+------------
4 | insur_start | categorical | 0 (0.0%) | Medicare | 193 (41.6%)
| | | | Commercial | 61 (13.1%)
| | | | Medicaid | 210 (45.3%)
--------------------------------------------------------------------
Suppose we want to simultaneously understand the impacts of two factors on BMI at the end of the period, specifically:
dep_start (1 or 0), anded_start (Low or High)and it is possible that the impact of depression diagnosis on BMI may depend on the home neighborhood’s educational attainment, and vice versa (i.e. depression and education may interact.)
Calculate and store the group means.
Now, we’ll plot these means, and look for a substantial interaction (non-parallel lines.)
ggplot(ex1_means, aes(x = dep_start, y = mean_bmi)) +
geom_line(aes(group = ed_start, color = ed_start)) +
geom_point(aes(color = ed_start)) +
scale_color_material() +
labs(title = "Example 1 Interaction Plot",
subtitle = "Do we see substantially non-parallel lines?",
y = "Mean BMI at end of period",
x = "Depression Diagnosis (1 = yes, 0 = no)",
color = "HS Graduation Rate")This model predicts our outcome (bmi_end) using the two factors as main effects only (with no interaction.)
Call:
lm(formula = bmi_end ~ dep_start + ed_start, data = dm18)
Coefficients:
(Intercept) dep_start1 ed_startLow
32.981 3.452 1.555
\[ bmi_{end} = 32.98 + 3.452 (dep_{start} = 1) + 1.555 (ed_{start} = Low) \]
Parameter | Coefficient | SE | 90% CI | t(461) | p
----------------------------------------------------------------------
(Intercept) | 32.98 | 0.65 | [31.91, 34.05] | 50.68 | < .001
dep start [1] | 3.45 | 0.84 | [ 2.06, 4.84] | 4.10 | < .001
ed start [Low] | 1.56 | 0.81 | [ 0.22, 2.89] | 1.92 | 0.056
ed_start value, but only one has a depression diagnosis at the start of the period, then on average the subject with a depression diagnosis will have a BMI that is 3.45 \(kg/m^2\) higher than the subject without such a diagnosis, according to our mainfx model.Parameter | Coefficient | SE | 90% CI | t(461) | p
----------------------------------------------------------------------
(Intercept) | 32.98 | 0.65 | [31.91, 34.05] | 50.68 | < .001
dep start [1] | 3.45 | 0.84 | [ 2.06, 4.84] | 4.10 | < .001
ed start [Low] | 1.56 | 0.81 | [ 0.22, 2.89] | 1.92 | 0.056
mainfx model estimates that if we have two subjects who have the same depression diagnosis status at the start of the period, but live in neighborhoods with different ed_start values (one High and one Low), then on average the subject living in the Low education neighborhood will have a BMI that is 1.56 \(kg/m^2\) higher than the other subject.Parameter | Coefficient | SE | 90% CI | t(461) | p
----------------------------------------------------------------------
(Intercept) | 32.98 | 0.65 | [31.91, 34.05] | 50.68 | < .001
dep start [1] | 3.45 | 0.84 | [ 2.06, 4.84] | 4.10 | < .001
ed start [Low] | 1.56 | 0.81 | [ 0.22, 2.89] | 1.92 | 0.056
mainfx model’s estimate of the mean of bmi_end across all subjects where both dep_start is 0 and ed_start = High.Analysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
dep_start 1 1303 1302.59 17.1020 4.21e-05 ***
ed_start 1 280 280.48 3.6824 0.05561 .
Residuals 461 35113 76.17
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mainfx and its assumptionsmainfx is a “main effects only” model. It assumes there is zero interaction between dep_start and ed_start in predicting our outcome, bmi_end.mainfx makes the usual linear model assumptions (linearity, constant variance, Normality)
mainfxFit the linear model, including an interaction (*) between our factors.
withint model equation?\[ bmi_{end} = 33.4 + 2.3 (dep_{start} = 1) \\ + 0.7 (ed_{start} = Low) \\ + 2.2 (dep_{start} = 1) \times (ed_{start} = Low) \]
dep_start |
ed_start |
Estimated bmi_end |
|---|---|---|
| 0 | High | 33.4 |
| 0 | Low | 33.4 + 0.7 = 34.1 |
| 1 | High | 33.4 + 2.3 = 35.7 |
| 1 | Low | 33.4 + 2.3 + 0.7 + 2.2 = 38.6 |
ed_start?Est. bmi_end |
ed_start = High |
ed_start = Low |
|---|---|---|
dep_start = 0 |
33.4 | 34.1 |
dep_start = 1 |
35.7 | 38.6 |
dep_start = 0, High ed_start yields 33.4, and Low yields 34.1, so the High - Low difference in estimates is -0.7.dep_start = 1, High ed_start yields 35.7, and Low yields 38.6, so the High - Low difference in estimates is -2.9.ed_start depends on dep_start.dep_start = 1 vs. 0?Est. bmi_end |
ed_start = High |
ed_start = Low |
|---|---|---|
dep_start = 1 |
35.7 | 38.6 |
dep_start = 0 |
33.4 | 34.1 |
ed_start is High, then dep_start = 1 yields 35.7, and 0 yields 33.4, so the difference in estimated bmi_end is 2.3.ed_start is Low, then dep_start = 1 yields 38.6, and 0 yields 34.1, so the difference in estimated bmi_end is 4.5.dep_start depends on ed_start.withint modelAnalysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
dep_start 1 1303 1302.59 17.1299 4.152e-05 ***
ed_start 1 280 280.48 3.6884 0.05541 .
dep_start:ed_start 1 133 133.25 1.7523 0.18625
Residuals 460 34979 76.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mainfx model?withint# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
------------------------------------------------------------------------------------------------------------------
withint | lm | 0.047 | 0.041 | 8.683 | 8.720 | 0.471 | 0.465 | 0.101 | 57.14%
mainfx | lm | 0.043 | 0.039 | 8.699 | 8.727 | 0.529 | 0.535 | 0.899 | 42.86%
bmi_end, overall.What do you think?
Suppose that we now want to understand the impact on BMI at the end of the period, of:
insur_start (Medicare, Commercial or Medicaid), anded_start (Low or High)and it’s possible that the impact of insurance on BMI may depend on the home neighborhood’s educational attainment, and vice versa (i.e. insurance and education may interact.)
Calculate and store group means.
ex2_means <- dm18 |>
group_by(insur_start, ed_start) |>
summarize(mean_bmi = mean(bmi_end))
ex2_means# A tibble: 6 × 3
# Groups: insur_start [3]
insur_start ed_start mean_bmi
<fct> <fct> <dbl>
1 Medicare High 33.6
2 Medicare Low 34.1
3 Commercial High 35.9
4 Commercial Low 34.4
5 Medicaid High 34.2
6 Medicaid Low 38.2
Now, we’ll plot these means, and look for a substantial interaction (non-parallel lines.)
ggplot(ex2_means, aes(x = insur_start, y = mean_bmi)) +
geom_line(aes(group = ed_start, color = ed_start)) +
geom_point(aes(color = ed_start)) +
scale_color_material() +
labs(title = "Example 2 Interaction Plot",
subtitle = "Do we see substantially non-parallel lines?",
y = "Mean BMI at end of period",
x = "Insurance at start of period",
color = "HS Graduation Rate")mainfx2 <- lm(bmi_end ~ insur_start + ed_start, data = dm18)
withint2 <- lm(bmi_end ~ insur_start * ed_start, data = dm18)
compare_performance(mainfx2, withint2, rank = TRUE)# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
-------------------------------------------------------------------------------------------------------------------
withint2 | lm | 0.037 | 0.027 | 8.727 | 8.784 | 0.758 | 0.747 | 0.048 | 85.71%
mainfx2 | lm | 0.024 | 0.018 | 8.786 | 8.824 | 0.242 | 0.253 | 0.952 | 14.29%
mainfx2 estimates?Parameter | Coefficient | SE | 90% CI | t(460) | p
--------------------------------------------------------------------------------
(Intercept) | 32.85 | 0.78 | [31.56, 34.13] | 42.02 | < .001
insur start [Commercial] | 1.44 | 1.30 | [-0.70, 3.57] | 1.11 | 0.269
insur start [Medicaid] | 2.40 | 0.88 | [ 0.94, 3.85] | 2.72 | 0.007
ed start [Low] | 1.79 | 0.82 | [ 0.44, 3.15] | 2.18 | 0.030
\[ bmi_{end} = 32.9 + 1.4 (Comm.) + 2.4 (Medicaid) + 1.8 (Low Ed.) \]
Est. bmi_end |
Medicare | Commercial | Medicaid |
|---|---|---|---|
ed_start High |
32.9 | 34.3 | 35.3 |
ed_start Low |
34.7 | 36.1 | 37.1 |
withint2 equation?Parameter | Coefficient | SE | 90% CI | t(458) | p
-------------------------------------------------------------------------------------------------
(Intercept) | 33.57 | 0.95 | [32.01, 35.13] | 35.44 | < .001
insur start [Commercial] | 2.31 | 1.84 | [-0.72, 5.34] | 1.26 | 0.210
insur start [Medicaid] | 0.66 | 1.26 | [-1.42, 2.74] | 0.53 | 0.599
ed start [Low] | 0.48 | 1.27 | [-1.61, 2.58] | 0.38 | 0.704
insur start [Commercial] × ed start [Low] | -1.94 | 2.58 | [-6.20, 2.32] | -0.75 | 0.452
insur start [Medicaid] × ed start [Low] | 3.44 | 1.76 | [ 0.55, 6.34] | 1.96 | 0.051
withint2 model equation\[ bmi_{end} = 33.6 + 2.3 Commercial + 0.7 Medicaid + 0.5 (ed_{start} = Low) \\ - 1.9 Commercial \times (ed_{start} = Low) + 3.4 Medicaid \times (ed_{start} = Low) \]
insur_start |
ed_start |
Estimated bmi_end |
|---|---|---|
| Medicare | High | 33.6 |
| Commercial | High | 33.6 + 2.3 = 35.9 |
| Medicaid | High | 33.6 + 0.7 = 34.3 |
| Medicare | Low | 33.6 + 0.5 = 34.1 |
| Commercial | Low | 33.6 + 2.3 + 0.5 - 1.9 = 34.5 |
| Medicaid | Low | 33.6 + 0.7 + 0.5 + 3.4 = 38.2 |
Analysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
insur_start 2 510 254.82 3.2727 0.03879 *
ed_start 1 369 369.45 4.7450 0.02989 *
Residuals 460 35817 77.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
insur_start 2 510 254.82 3.3029 0.03765 *
ed_start 1 369 369.45 4.7888 0.02915 *
insur_start:ed_start 2 482 240.88 3.1223 0.04500 *
Residuals 458 35335 77.15
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mainfx2withint2Example 2
bmi_end, overall.What do you think?
dm18a <- dm464 |>
mutate(ed_start = factor(ifelse(nedlev_start > 78, "High", "Low")),
dep_start = factor(depdiag_start)) |>
mutate(bmi_c = center(bmi_start)) |>
select(id_code, bmi_end, dep_start, ed_start, insur_start,
bmi_c, bmi_start)
dm18a |> reframe(lovedist(bmi_start))# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 464 0 35.2 8.82 33.4 8.01 17.3 28.9 40.4 75.1
# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 464 0 2.30e-15 8.82 -1.84 8.01 -17.9 -6.34 5.21 39.9
Analysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
bmi_c 1 34213 34213 6413.8007 < 2e-16 ***
dep_start 1 8 8 1.4107 0.23556
insur_start 2 29 14 2.7165 0.06718 .
dep_start:insur_start 2 9 4 0.8135 0.44394
Residuals 457 2438 5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Effect Size for ANOVA (Type I)
Parameter | Eta2 (partial) | 95% CI
-----------------------------------------------------
bmi_c | 0.93 | [0.93, 1.00]
dep_start | 3.08e-03 | [0.00, 1.00]
insur_start | 0.01 | [0.00, 1.00]
dep_start:insur_start | 3.55e-03 | [0.00, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
Analysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
bmi_c 1 34213 34213 6419.0169 < 2e-16 ***
dep_start 1 8 8 1.4119 0.23536
insur_start 2 29 14 2.7187 0.06702 .
Residuals 459 2446 5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: bmi_end
Df Sum Sq Mean Sq F value Pr(>F)
bmi_c 1 34213 34213 6414.4440 < 2e-16 ***
insur_start 2 29 15 2.7587 0.06442 .
Residuals 460 2453 5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Comparison of Model Performance Indices
Name | Model | R2 | R2 (adj.) | RMSE | Sigma | AIC weights | AICc weights | BIC weights | Performance-Score
---------------------------------------------------------------------------------------------------------------
fit4 | lm | 0.933 | 0.933 | 2.296 | 2.309 | 0.370 | 0.368 | 0.083 | 60.74%
fit5 | lm | 0.933 | 0.933 | 2.300 | 2.309 | 0.515 | 0.526 | 0.916 | 46.38%
fit3 | lm | 0.934 | 0.933 | 2.292 | 2.310 | 0.114 | 0.106 | 4.09e-04 | 28.57%
fit4 (1/3)fit4 (2/3)fit4 (3/3)R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
Package version:
askpass_1.2.1 backports_1.5.0 base64enc_0.1.3
bayestestR_0.15.0 bit_4.5.0 bit64_4.5.2
blob_1.2.4 broom_1.0.7 bslib_0.8.0
cachem_1.1.0 callr_3.7.6 cellranger_1.1.0
cli_3.6.3 clipr_0.8.0 coda_0.19-4.1
codetools_0.2-20 colorspace_2.1-1 compiler_4.4.1
conflicted_1.2.0 correlation_0.8.6 cpp11_0.5.0
crayon_1.5.3 curl_6.0.0 data.table_1.16.2
datasets_4.4.1 datawizard_0.13.0 DBI_1.2.3
dbplyr_2.5.0 digest_0.6.37 dplyr_1.1.4
dtplyr_1.3.1 easystats_0.7.3 effectsize_0.8.9
emmeans_1.10.5 estimability_1.5.1 evaluate_1.0.1
fansi_1.0.6 farver_2.1.2 fastmap_1.2.0
fontawesome_0.5.2 forcats_1.0.0 fs_1.6.5
gargle_1.5.2 generics_0.1.3 ggplot2_3.5.1
ggrepel_0.9.6 glue_1.8.0 googledrive_2.1.1
googlesheets4_1.1.1 graphics_4.4.1 grDevices_4.4.1
grid_4.4.1 gtable_0.3.6 haven_2.5.4
highr_0.11 hms_1.1.3 htmltools_0.5.8.1
httr_1.4.7 ids_1.0.1 insight_0.20.5
isoband_0.2.7 janitor_2.2.0 jquerylib_0.1.4
jsonlite_1.8.9 knitr_1.49 labeling_0.4.3
lattice_0.22-6 lifecycle_1.0.4 lubridate_1.9.3
magrittr_2.0.3 MASS_7.3-61 Matrix_1.7-0
memoise_2.0.1 methods_4.4.1 mgcv_1.9-1
mime_0.12 modelbased_0.8.9 modelr_0.1.11
multcomp_1.4-26 munsell_0.5.1 mvtnorm_1.3-1
nlme_3.1-164 numDeriv_2016.8.1.1 openssl_2.2.2
parameters_0.23.0 patchwork_1.3.0 performance_0.12.4
pillar_1.9.0 pkgconfig_2.0.3 prettyunits_1.2.0
processx_3.8.4 progress_1.2.3 ps_1.8.1
purrr_1.0.2 R6_2.5.1 ragg_1.3.3
rappdirs_0.3.3 RColorBrewer_1.1.3 Rcpp_1.0.13-1
readr_2.1.5 readxl_1.4.3 rematch_2.0.0
rematch2_2.1.2 report_0.5.9 reprex_2.1.1
rlang_1.1.4 rmarkdown_2.29 rstudioapi_0.17.1
rvest_1.0.4 sandwich_3.1-1 sass_0.4.9
scales_1.3.0 see_0.9.0 selectr_0.4.2
snakecase_0.11.1 splines_4.4.1 stats_4.4.1
stringi_1.8.4 stringr_1.5.1 survival_3.7-0
sys_3.4.3 systemfonts_1.1.0 textshaping_0.4.0
TH.data_1.1-2 tibble_3.2.1 tidyr_1.3.1
tidyselect_1.2.1 tidyverse_2.0.0 timechange_0.3.0
tinytex_0.54 tools_4.4.1 tzdb_0.4.0
utf8_1.2.4 utils_4.4.1 uuid_1.2.1
vctrs_0.6.5 viridisLite_0.4.2 vroom_1.6.5
withr_3.0.2 xfun_0.48 xml2_1.3.6
xtable_1.8-4 yaml_2.3.10 zoo_1.8-12
431 Class 18 | 2024-10-29 | https://thomaselove.github.io/431-2024/